-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checking for missing files in parallel #224
Conversation
✅ Deploy Preview for silly-keller-664934 ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
Codecov Report
@@ Coverage Diff @@
## master #224 +/- ##
========================================
- Coverage 87.0% 87.0% -0.1%
========================================
Files 29 29
Lines 1930 1937 +7
========================================
+ Hits 1681 1686 +5
- Misses 249 251 +2
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks great and 5-6 minutes instead of 2 hours is a fantastic improvement!
I agree it's worth testing with local files as well. Can you do that? You can just mock it by downloading one sample videos, making 100 copies or so (just a little bash script would do), and then creating a simple labels file, e.g.
filepath,label
vid1.mp4,gorilla
vid2.mp4,gorilla
vid3.mp4,gorilla
vid4.mp4,gorilla
Confirmed that it works with 22 local files (a convenience sample of videos) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @AllenDowney!
Closes #216
Checking for missing files is slow with goofys. It seems to make lots of small queries to the file system. Running them in parallel with
pqdm
is much faster. The speed depends on the state of the file system cache, but we can check 246,000 files in 5-8 minutes, compared to about two hours the slow way.Using
pqdm
with threads is faster than with processes. Using 16 threads seems to be fast and robust. With more threads, things go faster, but you start to see unpredictable I/O errors.If an error occurs, it falls back to the slow way.
This fix has only been tested with video files that are mounted from S3 using goofys. It might be good to test with videos stored in a local file system, too.